Large-scale strongly supervised ensemble metric learning

ABSTRACT

Systems and methods for metric learning include iteratively determining feature groups of images based on its derivative norm. Corresponding metrics of the feature groups are learned by gradient descent based on an expected loss. The corresponding metrics are combined to provide an intermediate metric matrix as a sparse representation of the images. A loss function of all metric parameters corresponding to features of the intermediate metric matrix are optimized, using a processor, to learn a final metric matrix. Eigenvalues of the final metric matrix are projected onto a simplex.

RELATED APPLICATION INFORMATION

This application claims priority to provisional application Ser. No.61/562,102 filed on Nov. 21, 2011, incorporated herein by reference inits entirety.

BACKGROUND

1. Technical Field

The present invention relates to metric learning and more specificallyto large-scale strongly supervised ensemble metric learning.

2. Description of the Related Art

The goal of metric learning is to find appropriate similaritymeasurements between pairs of instances that preserve a desired distancestructure. Recently, many supervised metric learning methods have beenproposed to learn Mahanalobis distance metrics for clustering ork-nearest neighbor classification. Supervised metric learning can bedivided into two categories based upon supervision type. Weaklysupervised metric learning learns metrics from directly providedpairwise constraints between instances. Such weak constrains are alsoknown as side information. Strongly supervised metric learning receivesexplicit class labels assigned to every instance from which a largenumber of constraints can be generated. While conventional metriclearning methods perform well in data sets with a smaller number offeatures, they are very limited in tasks with high dimensional data.This is particularly true when using overcomplete representations ofdata, where high amounts of redundancy need to be carefully addressed.

SUMMARY

A method for metric learning includes iteratively determining featuregroups of images based on its derivative norm. Corresponding metrics ofthe feature groups are learned by gradient descent based on an expectedloss. The corresponding metrics are combined to provide an intermediatemetric matrix as a sparse representation of the images. A loss functionof all metric parameters corresponding to features of the intermediatemetric matrix are optimized, using a processor, to learn a final metricmatrix. Eigenvalues of the final metric matrix are projected onto asimplex.

A system for metric learning includes a sparse block diagonal metricensembling module configured to iteratively determine feature groups ofimages based on its derivative norm, learn corresponding metrics of thefeature groups by gradient descent based on an expected loss, andcombine the corresponding metrics to provide an intermediate metricmatrix as a sparse representation of the images. A joint metric learningmodule is configured to optimize, using a processor, a loss function ofall metric parameters corresponding to features of the intermediatemetric matrix to learn a final metric matrix. Eigenvalues of the finalmetric matrix are projected onto a simplex.

These and other features and advantages will become apparent from thefollowing detailed description of illustrative embodiments thereof,which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF DRAWINGS

The disclosure will provide details in the following description ofpreferred embodiments with reference to the following figures wherein:

FIG. 1 is a block/flow diagram showing a method for metric learning inaccordance with one embodiment; and

FIG. 2 is a block/flow diagram showing a system for metric learning inaccordance with one embodiment.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

In accordance with the present principles, systems and methods forlarge-scale strongly supervised ensemble metric learning are provided.Ensemble metric learning includes two consecutive steps: sparse blockdiagonal metric ensembling and joint metric learning. Sparse blockdiagonal metric ensembling selects effective features and learns theirassociated weak metrics that correspond to diagonal blocks of aMahanalobis matrix in the entire feature space. Joint metric learninglearns another Mahanalobis matrix in the feature subspace enabled by thesparse block diagonal metric ensembling step by jointly consideringalready selected features, with an optional low-rank constraint topursue final representations of instances in an even lower space.Advantageously, large-scale strongly supervised ensemble metric learningis able to learn a sparse combination of features from an overcompletefeature set to achieve very low-dimensional representation of eachinstance to facilitate, e.g., image verification and retrieval tasks.

Embodiments described herein may be entirely hardware, entirely softwareor including both hardware and software elements. In a preferredembodiment, the present invention is implemented in software, whichincludes but is not limited to firmware, resident software, microcode,etc.

Embodiments may include a computer program product accessible from acomputer-usable or computer-readable medium providing program code foruse by or in connection with a computer or any instruction executionsystem. A computer-usable or computer readable medium may include anyapparatus that stores, communicates, propagates, or transports theprogram for use by or in connection with the instruction executionsystem, apparatus, or device. The medium can be magnetic, optical,electronic, electromagnetic, infrared, or semiconductor system (orapparatus or device) or a propagation medium. The medium may include acomputer-readable storage medium such as a semiconductor or solid statememory, magnetic tape, a removable computer diskette, a random accessmemory (RAM), a read-only memory (ROM), a rigid magnetic disk and anoptical disk, etc.

A data processing system suitable for storing and/or executing programcode may include at least one processor coupled directly or indirectlyto memory elements through a system bus. The memory elements can includelocal memory employed during actual execution of the program code, bulkstorage, and cache memories which provide temporary storage of at leastsome program code to reduce the number of times code is retrieved frombulk storage during execution. Input/output or I/O devices (includingbut not limited to keyboards, displays, pointing devices, etc.) may becoupled to the system either directly or through intervening I/Ocontrollers.

Network adapters may also be coupled to the system to enable the dataprocessing system to become coupled to other data processing systems orremote printers or storage devices through intervening private or publicnetworks. Modems, cable modem and Ethernet cards are just a few of thecurrently available types of network adapters.

Initially, symbols and notations that will be used throughout thisdiscussion are provided. An instance is represented by K feature groupsasx=[x ⁽¹⁾ ,x ⁽²⁾ , . . . ,x ^((K))]^(T)ε

^(D) ,x ^((K))ε

^(D)where x^((K)) is the k-th feature group with d features and theconcatenated feature dimensionality D=Kd.

A squared Mahanalobis distance metric isd _(ij) ^(A)=(x _(i) −x _(j))^(T) A(x _(i) −x _(j)),∀x _(i) ,x _(j)ε

^(D) ,A≧0where A is a Mahanalobis matrix.

B

^(D×D) the block matrix space in which matrices consist of K×K blocks,each of size d×d.

B_(kl) is the sparse block matrix space where only the block in the k-throw and the l-th column is non-zero.

└A┘_(kl) is the projection matrix A onto B_(kl)

∥A∥_(F), tr(A) and r(A) are the Frobenius norm, trace norm and rank ofA.

∥A∥_(S0)=card{k|└A┘_(kl)≠0

└A┘_(lk)≠0∃l} is the number of feature groups used by A (i.e., thedefined structural l⁰ norm of A).

Π_(PSD)(A) is the projection of A onto Positive Semi-Definite space;Π_(v)(A) is the simplex projection to make its trace norm lower than v.

x_(i)˜x_(j) or π_(ij)=+1 denote x_(i) and x_(j) are of the samecategory; x_(i)

x_(k) or π_(ik)=−1 denote they are of different categories.

N=|χ|, N_(i) ⁺=|{x_(j)|x_(j)˜x_(i),x_(j)εχ}| and N_(i) ⁻=|x_(k)|x_(k)

x_(i),x_(k)εχ}| are the total number of training samples, the number ofsame-category and the number of different category samples to x_(i),respectively.

Consider the situation where instances are represented by a largecollection of fixed-size feature groups without loss of generality tocases with varying-size feature groups. These feature groups could besubspaces of the original feature, or wavelet descriptors at differentpositions and scales, such as, e.g., scale-invariant feature transform(SIFT) and local binary pattern (LBP) features. Due to the hugeredundancy in overcomplete representations, a desired metric shouldavoid using feature groups with little discriminability so as toestimate similarities between instances very efficiently without loss ofaccuracy. As such, the metric learning may be formulated as follows:

$\begin{matrix}\begin{matrix}\min\limits_{A} & {{f\left( {A\text{❘}\chi} \right)} = {{\frac{\lambda}{2}{A}_{F}^{2}} + {l\left( {A\text{❘}\chi} \right)}}} \\{{subject}\mspace{14mu}{to}} & {{A \geq 0},{{A}_{S\; 0} \leq \mu},{{{tr}(A)} \leq v}}\end{matrix} & (1)\end{matrix}$in which l(A|χ) is the empirical loss function regarding thediscriminability of Mahanalobis matrix A upon training set χ. Theregularization term penalizes matrix A by its squared Frobenius normwith coefficient λ for better generalization ability; A≧0 is to keep thelearned metric satisfying triangle inequality; tr(A)≦v is to obtain alow-rank matrix A so that every instance eventually can be representedin a low-dimensional space; and in particular, ∥A∥_(S0)≦μ is to imposegroup sparsity on matrix A to insure that only a limited number offeature groups (smaller than μ) will be actually involved.

However, the optimization task in Equation (1) is NP hard due to thestructural l⁰ norm and, thus, extremely difficult to solve with highdimensional overcomplete representations of data. Referring now to FIG.1, a block/flow diagram showing a method for metric learning 100 isillustratively depicted in accordance with one embodiment. In block 102,sparse block diagonal metric ensembling is performed. Pseudocode 1illustratively depicts sparse block diagonal metric ensembling inaccordance with one embodiment.

  Pseudocode 1: Sparse block diagonal metric ensembling. Input: χ, μ,and λ A→0 for t = 1 to μ, do  $\kappa = {\underset{{\kappa \in 1},2,\ldots\mspace{11mu},K}{{{ar}g}\mspace{11mu}\max}{{\Pi_{PSD}\left( {- \left\lfloor \frac{\partial{f\left( A \middle| \chi \right)}}{\partial A} \right\rfloor_{\kappa\kappa}} \right)}}_{2}}$  A_(κ)^(*), α^(*) = f(α A + A_(κ)|χ)  (A→α^(*)A + A^(*) _(κ) end forA_(†) = A, L_(†) = U where UΛU^(T) = A, U ∈

^(D×D†) Output: A_(†) and L_(†)

Starting with an empty feature group set (A=0), in block 104, effectivefeature groups are iteratively determined (indicated by κ). Effectivefeature groups refer to the group of features most like to reduce theloss objective value. Preferably, the criterion is the largestderivative norm of the loss object function. In each iteration, the k-thfeature group is determined as the effective feature group (i.e., thefeature group having the largest derivative norm). The opposite ofpartial derivative matrix is projected onto Positive Semi-Definite spaceso that it decreases the loss function while keeping the updated matrixPositive Semi-Definite. In block 106, weak metrics (A_(κ)*)corresponding to the effective feature groups are learned as metricswith a smallest expected loss. Weak metrics refer to the metrics learnedin each iteration. Every candidate feature group is evaluated by thepartial derivative of loss function ƒ(•) with respect to itscorresponding diagonal block in matrix A. Preferably, corresponding weakmetrics are learned by gradient descent.

In block 108, the corresponding weak metrics are combined into a strongmetric to provide an intermediate Mahanalobis distance metric matrix,A_(†). The strong metric refers to the combination of all weak metricslearned in each iteration. Sparse block diagonal metric ensemblingselects the diagonal block with the largest l² norm of projected partialderivative matrix and optimizes it with a scale factor α adjusting thepreviously learned matrix to minimize the loss function. After μiterations of weak metric learning, an intermediate Mahanalobis distancemetric, A_(†), is obtained with at most μ feature groups involved, whoseorthogonal linear transformation matrix L_(†) preliminarily reducesfeature dimensionality from D to D_(†) (D_(†)<<D).

In block 110, joint metric learning is then performed. The goal of jointmetric learning is to learn a better Mahanalobis metric than determinedin sparse block diagonal metric ensembling 102 using the correspondingfeatures of the intermediate distance metric. Owning to the superviseddimension reduction achieved by sparse block diagonal metric ensemblingin block 102, joint metric learning is capable of further exploitingcorrelations between selected feature groups in the intermediate featurespace χ_(†) without diagonal block constraints. The projected gradientdescent method may be adopted to solve this optimization program.Pseudocode 2 illustratively depicts joint metric learning in accordancewith one embodiment.

  Pseudocode 2: Joint metric learning.   Input: χ, v, λ, and U_(†)Dimension reduction: χ_(†) = {U^(T) _(†)x|x ∈ χ} A→0 while do notconverge do  ${\nabla{f\left( A \middle| \chi_{+} \right)}} = \frac{\partial{f\left( A \middle| \chi_{+} \right)}}{\partial A}$ choose a proper step γ by line search  A → Π_(v)A-γ∇f(A|χ₊) end while L_(‡=) L_(†)L where LL^(T) = A  A_(‡) = L_(‡)L^(T) _(‡) Output: A_(‡)and L_(‡)

In block 112, a loss function of all metric parameters corresponding tothe selected effective feature groups is iteratively optimized bygradient descent with a proper step size. The term “all metricparameters” are used to distinguish those parameters in sparse blockdiagonal metric learning 102. In sparse block diagonal metric learning102, only the parameters within each weak metric (i.e., each featuregroup) are tuned. In other words, the parameters across differentfeature groups are set to zero, thus the metric parameters form a blockdiagonal matrix. In joint metric learning 110, all metric parameters aretuned (i.e., they form a full matrix). In a preferred embodiment, theselected effective feature group includes the corresponding features ofthe intermediate metric. The proper step size is preferably determinedby a line search method, however other methods are also contemplated.Gradient descent may include any method of gradient descent.

In block 114, the Mahanalobis matrix is regulated by projecting itseigenvalues onto a simplex for satisfying tr(A)≦v and A≧0 to learn afinal metric matrix with low rank regularization. In this way, the jointmetric learning method may learn a secondary linear transformationmatrix L to map instances onto even lower dimensional space. The finallinear transformation matrix L_(‡)=L_(†)L helps represent all instancesin a D_(‡)-dimensional space, where Euclidean distance is the optimalmetric for similarity measurement. In other words, A_(‡)=L_(‡)L_(‡) ^(T)is the final Mahanalobis matrix. Low rank regularization refers to themetric parameter matrix (i.e., using all metric parameters) should be alow rank matrix. To obtain a low rank matrix from a general full matrix,the present principles perform a projection (i.e., projectingeigenvalues onto a simplex). The simplex refers to N non-negativenumbers whose sum is one. By projecting eigenvalues onto a simplex, manyeigenvalues are forced to be zeros, thereby providing a low rank matrixby composing matrix eigenvectors and projected eigenvalues.

The computation of empirical loss function l(A|χ) and its gradient,which is defined by constraints between instances, may be importantsteps in this method 100. From training data with explicit class labels,two types of constraints can be generated: pairwise and triplet. Forexample, let x_(i) and x_(j) be two instances of the same category andx_(k) be the instance of another category. From the view point of x_(i),on one side, pairwise constraints are d_(ij) ^(A)<θ and d_(ik) ^(A)>θ,where θ is a general threshold separating all similar pairs fromdissimilar ones. Constraints of this type are adopted in verificationproblems that determine whether a pair of instances belong to the samecategory or not. On the other side, the triple constraint is d_(ij)^(A)≦d_(ik) ^(A), which is a ranking preference designed for clusteringor retrieval tasks that are concerned with relative difference ofdistances between instances.

The empirical error of A with threshold θ on all pairwise constraintsfrom χ is defined by:

$\begin{matrix}\begin{matrix}{{\in_{\theta}\left( {A\text{❘}\chi} \right)} = {\Pr\left( {{{\pi_{i\; j}\left( {d_{ij}^{A} - \theta} \right)} > {0\text{❘}x_{i}}},{x_{j} \in \chi}} \right)}} \\{= {E_{{xi},{{xj} \in \chi}}1_{{\pi\;{{ij}{({d_{ij}^{A} - \theta})}}} > 0}}}\end{matrix} & (2)\end{matrix}$in which π_(ij)=±1 indicates whether x_(i) and x_(j) belong to the samecategory or not, and 1_((•)) is the characteristic function that outputs1 is (•) is satisfied or 0 otherwise. By replacing 1_((•)) with theexponential-based logit surrogate function,

${{\psi_{\beta}\left( {\mathbb{e}}^{z} \right)} = \frac{\ln\left( {1 + {\beta\;{\mathbb{e}}^{z}}} \right)}{\ln\left( {1 + \beta} \right)}},$and setting β=1, the upper bound of the empirical error is obtained asfollows.

$\begin{matrix}\begin{matrix}{{\in_{\theta}\left( {A\text{❘}\chi} \right)} = {E_{{xi},{{xj} \in \chi}}{{\psi 1}\left( {\mathbb{e}}^{d_{ij}^{A} - \theta} \right)}}} \\{= {\frac{1}{N^{2}\ln\; 2}{\sum\limits_{i,j}\;{\ln\left( {1 + {\mathbb{e}}^{\pi_{ij}{({d_{ij}^{A} - \theta})}}} \right)}}}} \\{= {l_{\theta}\left( {A\text{❘}\chi} \right)}}\end{matrix} & (3)\end{matrix}$which is smooth and convex, serving as the empirical loss function withpairwise constraints.

Let η_(ij)=d_(ij) ^(A)−θ. Applying the chain rule results in thefollowing:

$\begin{matrix}\begin{matrix}{\frac{\partial{l_{\theta}\left( {A\text{❘}\chi} \right)}}{\partial A} = {\sum\limits_{i,j}\;{\frac{\partial{l_{\theta}\left( {A\text{❘}\chi} \right)}}{\partial\eta_{ij}} \cdot \frac{\partial\eta_{ij}}{\partial A}}}} \\{= {\sum\limits_{i,j}\;{{w_{ij} \cdot \left( {x_{i} - x_{j}} \right)}\left( {x_{i} - x_{j}} \right)^{T}}}}\end{matrix} & (4)\end{matrix}$in which the weight term is:

$\begin{matrix}\begin{matrix}{w_{ij} = \frac{\partial{l_{\theta}\left( {A\text{❘}\chi} \right)}}{\partial A}} \\{= {\frac{\pi_{ij}}{N^{2}\ln\; 2} \cdot \frac{{\mathbb{e}}^{\pi_{ij}{({d_{ij}^{A} - \theta})}}}{1 + {\mathbb{e}}^{\pi_{ij}{({d_{ij}^{A} - \theta})}}}}}\end{matrix} & (5)\end{matrix}$

Given the weight matrix W={w_(ij)}_(N×N) equation (4) can be efficientlycomputed by:

$\begin{matrix}{\frac{\partial{l_{\theta}\left( {A\text{❘}\chi} \right)}}{\partial A} = {{X\left( {S - W - W^{T}} \right)}X^{T}}} & (6)\end{matrix}$where X=[x₁, x₂, . . . , x_(N)] is the feature matrix of χ andX=diag(Σ_(k) w _(1k) +w _(k1), . . . ,Σ_(k) w _(Nk) +w _(kN)).

The empirical error of A on all triplet constraints from χ is defined asfollows.

$\begin{matrix}\begin{matrix}{{\in \left( {A\text{❘}\chi} \right)} = {\Pr\left( {{d_{ij}^{A} > {d_{ik}^{A}\text{❘}x_{j}} \sim x_{i}},{x_{k} \nsim x_{i}}} \right)}} \\{= {E_{{xi},{{xj} \sim {xi}},{xk}} \nsim_{xi}1_{d_{ij}^{A} > d_{ik}^{A}}}}\end{matrix} & (7)\end{matrix}$Similarly, the upper bound of this empirical error is as follows.ε(A|χ)≦E _(xi,xj˜xi,xk)

_(xi)ψ_(β)(e ^(d) ^(ij) ^(A) ^(−d) ^(ik) ^(A) )= l (A|χ)  (8)

However, this is not an appropriate loss function as the computationalcomplexity given {d_(ij) ^(A)|∀i, j} could be O(N³). By using theconcavity of ψ_(β)(•), it is further relaxed as follows.

$\begin{matrix}{\begin{matrix}{{{\overset{\_}{l}\left( {A\text{❘}\chi} \right)} \leq {E_{xi}{\psi_{\beta}\left( {E_{{{xj} \sim {xi}},{xk}} \nsim_{xi}{\mathbb{e}}^{d_{ij}^{A} - d_{ik}^{A}}} \right)}}} = {E_{xi}{\psi_{\beta}\left( {{E_{{xj} \sim {xi}}{{\mathbb{e}}^{d_{ij}^{A}} \cdot E_{xk}}} \nsim_{xi}{\mathbb{e}}^{- d_{ik}^{A}}} \right)}}} \\{= {E_{xi}{\psi_{\beta}\left( {\varphi_{i}^{+} \cdot \varphi_{i}^{-}} \right)}}} \\{= {l\left( {A\text{❘}\chi} \right)}}\end{matrix}\mspace{20mu}{where}} & (9) \\{\mspace{79mu}\begin{matrix}\begin{matrix}{\phi_{i}^{+} = {E_{{xj} \sim {xi}}{\mathbb{e}}^{d_{ij}^{A}}}} \\{= {\frac{1}{N_{i}^{+}}{\sum\limits_{{xj} \sim {xi}}\;{\mathbb{e}}^{d_{ij}^{A}}}}}\end{matrix} \\\begin{matrix}{\phi_{i}^{-} = {E_{{xk} \nsim {xi}}{\mathbb{e}}^{d_{ik}^{A}}}} \\{= {\frac{1}{N_{i}^{-}}{\sum\limits_{{xk} \sim {xi}}\;{\mathbb{e}}^{d_{ik}^{A}}}}}\end{matrix}\end{matrix}} & (10)\end{matrix}$Equation (9) is a loss function holding the upper bound of empiricalerror with all triplet constraints generated from χ. It's computationalcomplexity given {d_(ij) ^(A)|∀, j} is just O(N²), the same as that withpairwise constraints in equation (3).

Similar to equations (4) and (5) for pairwise constraints, the lossfunction is reformulated as follows.

$\begin{matrix}\begin{matrix}{\frac{\partial{l\left( {A\text{❘}\chi} \right)}}{\partial A} = {\sum\limits_{i,j}\;{{w_{ij} \cdot \left( {x_{i} - x_{j}} \right)}\left( {x_{i} - x_{j}} \right)^{T}}}} \\{where} \\{{w_{ij} = {{\frac{{\beta\phi}_{i}^{+}{\exp\left( d_{ij}^{A} \right)}}{N\; N_{i}^{+}{{\ln\left( {1 + \beta} \right)} \cdot \left( {1 + {{\beta\phi}_{i}^{+}\phi_{i}^{-}}} \right)}}\mspace{14mu}{for}\mspace{14mu} x_{j}} \sim x_{i}}},} \\{and} \\{w_{ij} = {{{- \frac{{\beta\phi}_{i}^{-}{\exp\left( {- d_{ij}^{A}} \right)}}{N\; N_{i}^{-}{{\ln\left( {1 + \beta} \right)} \cdot \left( {1 + {{\beta\phi}_{i}^{+}\phi_{i}^{-}}} \right)}}}\mspace{14mu}{for}\mspace{14mu} x_{j}} \nsim {x_{i}.}}}\end{matrix} & (11)\end{matrix}$

Advantageously, the metric learning method 100 is able to learn a sparsecombination of features from an overcomplete feature set to achieve verylow-dimensional representation of every instance to facilitate, e.g.,image verification and retrieval tasks. The method 100 preserves gooddiscriminability to distinguish objects of different categories withlittle computational resources, which may be important in processinglarge data sets.

Referring now to FIG. 2, a block/flow diagram showing a system formetric learning 200 is illustratively depicted in accordance with oneembodiment. The metric learning system 202 preferably includes one ormore processors 212 and memory 206 for storing programs andapplications. It should be understood that the functions and componentsof system 202 may be integrated into one or more systems.

The system 202 may include a display 208 for viewing. The display 208may also permit a user to interact with the system 202 and itscomponents and functions. This is further facilitated by a userinterface 210, which may include a keyboard, mouse, joystick, or anyother peripheral or control to permit user interaction with the system202.

The system 202 may receive an input 204, such as, e.g., input images andan image database. The input images preferably include training images.Memory 206 may include sparse block diagonal metric ensembling module214. Metric ensembling module 214 sequentially selects a set of featuresto constitute a sparse representation of an image from an overcompletefeature set. Metric ensembling module 214 starts from an empty featuregroup set, progressively chooses effective feature groups, learns weakmetrics that correspond to diagonal blocks of a Mahanalobis matrix inthe entire feature space, and combines the weak metrics into a strongmetric to provide an intermediate Mahanalobis distance metric. Metricensembling module 214 figures out an optimal combination of simplefeatures to pursue low cost in coding every image (e.g., input image andimages of the image database).

Memory 206 may also include joint metric learning module 216 configuredto further reduce linear dimensionality to maximize discriminabilitybetween images of different people while minimizing the distance betweenimages of the same person. In this way, each face image can berepresented by a low-dimensional vector and, e.g., a Euclidean distancemeasures the dissimilarity between them. The joint metric learningmodule 216 learns another Mahanalobis matrix in the feature subspaceenabled by the metric ensembling module 214 by jointly consideringalready selected features with an optional low-rank constraint to pursuea final representation of instances in an even lower space. The jointmetric learning module 216 iteratively optimizes a loss function of allmetric parameters corresponding to features of the intermediate metricmatrix and regulates it by projecting it eigenvalues onto a simplex toprovide a final metric matrix with low rank regularization. The output218 of the metric learning system 202 may include the final metricmatrix.

Having described preferred embodiments of a system and method forlarge-scale strongly supervised ensemble metric learning (which areintended to be illustrative and not limiting), it is noted thatmodifications and variations can be made by persons skilled in the artin light of the above teachings. It is therefore to be understood thatchanges may be made in the particular embodiments disclosed which arewithin the scope of the invention as outlined by the appended claims.Having thus described aspects of the invention, with the details andparticularity required by the patent laws, what is claimed and desiredprotected by Letters Patent is set forth in the appended claims.

What is claimed is:
 1. A method for metric learning, comprising:iteratively determining feature groups of images based on its derivativenorm; learning corresponding metrics of the feature groups by gradientdescent based on an expected loss; combining the corresponding metricsto provide an intermediate metric matrix as a sparse representation ofthe images; determining${\kappa = {\underset{{\kappa \in 1},2,\ldots,K}{argmax}{{\prod_{PSD}\;\left( {- \left\lfloor \frac{\partial{f\left( A \middle| \chi \right)}}{\partial A} \right\rfloor_{\kappa\kappa}} \right)}}_{2}A_{\kappa\;}^{*}}},{\alpha^{*} = \left. {\underset{{{A_{\kappa} \geq 0},{A_{\kappa} \in B_{\kappa\kappa}},{\alpha \in \mathcal{R}^{+}}}\;}{argmin}{f\left( {{\alpha\; A} + A_{\kappa}} \middle| \chi \right)}A}\rightarrow{{\alpha^{*}A} + A_{\kappa}^{*}} \right.}$where K feature groups comprisex=[x ⁽¹⁾ ,x ⁽²⁾ , . . . ,x ^((K))]^(T)ε

^(D) ,x ^((K))ε

^(D) and where x^((K)) is the k-th feature group with d features and theconcatenated feature dimensionality D=Kd with Mahanalobis matrix A,training set χ, weak metrics A_(κ)* corresponding to effective featuregroups; optimizing, using a processor, a loss function of all metricparameters corresponding to features of the intermediate metric matrixto learn a final metric matrix; and projecting eigenvalues of the finalmetric matrix onto a simplex.
 2. The method as recited in claim 1,wherein iteratively determining feature groups of images includesprojecting an opposite of partial derivative matrix onto positivesemi-definite space.
 3. The method as recited in claim 1, whereiniteratively determining feature groups of images includes evaluatingeach feature group by a partial derivative of loss function.
 4. Themethod as recited in claim 1, wherein optimizing the loss functionincludes optimizing the loss function of all metric parameterscorresponding to features of the intermediate metric matrix by gradientdescent.
 5. The method as recited in claim 4, wherein optimizing theloss function includes determining a step size by line search.
 6. Asystem for metric learning, comprising: a sparse block diagonal metricensembling module configured to iteratively determine feature groups ofimages based on its derivative norm, learn corresponding metrics of thefeature groups by gradient descent based on an expected loss, andcombine the corresponding metrics to provide an intermediate metricmatrix as a sparse representation of the images using:${\kappa = {\underset{{\kappa \in 1},2,\ldots,K}{argmax}{{\prod_{PSD}\;\left( {- \left\lfloor \frac{\partial{f\left( A \middle| \chi \right)}}{\partial A} \right\rfloor_{\kappa\kappa}} \right)}}_{2}A_{\kappa\;}^{*}}},{\alpha^{*} = \left. {\underset{{{A_{\kappa} \geq 0},{A_{\kappa} \in B_{\kappa\kappa}},{\alpha \in \mathcal{R}^{+}}}\;}{argmin}{f\left( {{\alpha\; A} + A_{\kappa}} \middle| \chi \right)}A}\rightarrow{{\alpha^{*}A} + A_{\kappa}^{*}} \right.}$where K feature groups includex=[x ⁽¹⁾ ,x ⁽²⁾ , . . . ,x ^((K))]^(T)ε

^(D) ,x ^((K))ε

^(D) and where x^((K)) is the k-th feature group with d features and theconcatenated feature dimensionality D=Kd with Mahanalobis matrix A,training set χ, weak metrics A_(κ)* corresponding to effective featuregroups; and a joint metric learning module configured to optimize, usinga processor, a loss function of all metric parameters corresponding tofeatures of the intermediate metric matrix to learn a final metricmatrix, and project eigenvalues of the final metric matrix onto asimplex.
 7. The system as recited in claim 6, wherein the sparse blockdiagonal metric ensembling module is further configured to project anopposite of partial derivative matrix onto positive semi-definite space.8. The system as recited in claim 6, wherein the sparse block diagonalmetric ensembling module is further configured to evaluate each featuregroup by a partial derivative of loss function.
 9. The system as recitedin claim 6, wherein the joint metric learning module is furtherconfigured to optimize the loss function of all metric parameterscorresponding to features of the intermediate metric matrix by gradientdescent.
 10. The system as recited in claim 9, wherein the joint metriclearning module is further configured to determine a step size by linesearch.